The PySpark API offers multiple ways of performing aggregation. When performing aggregations, data is usually shuffled between partitions. This
shuffling is needed to compute the result correctly. It has an associated cost that can impact performance, as shuffling moves data over the network
between Spark tasks.
There are however cases where some aggregation methods could be more efficient than others. For example when using RDD.groupByKey
in
conjunction with RDD.mapValues
if the function passed to RDD.mapValues
is commutative and associative, it is preferable to
use RDD.reduceByKey
instead. The performance gain from RDD.reduceByKey
comes from the amount of data that needs to be moved
between PySpark tasks. RDD.reduceByKey
will effectively reduce the number of rows in a partition before sending the data over the network
for further reduction. On the other hand, when using RDD.groupByKey
with RDD.mapValues
the reduction is only done after the
data has been moved around the cluster, effectively slowing down the computation process by transferring a higher amount of data over the network.